NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Scalable Breadth-First Search on a GPU Cluster

https://doi.org/10.1109/IPDPS.2018.00118

Pan, Yuechao; Pearce, Roger; Owens, John D. (May 2018, Proceedings of the 31st IEEE International Parallel and Distributed Processing Symposium)

On a GPU cluster, the ratio of high computing power to communication bandwidth makes scaling breadth-first search (BFS) on a scale-free graph extremely challenging. By separating high and low out-degree vertices, we present an implementation with scalable computation and a model for scalable communication for BFS and direction-optimized BFS. Our communication model uses global reduction for high-degree vertices, and point-to-point transmission for low-degree vertices. Leveraging the characteristics of degree separation, we reduce the graph size to one third of the conventional edge list representation. With several other optimizations, we observe linear weak scaling as we increase the number of GPUs, and achieve 259.8 GTEPS on a scale-33 Graph500 RMAT graph with 124 GPUs on the latest CORAL early access system.Proceedings of the 31st IEEE International Parallel and Distributed Processing Symposium
more » « less
Full Text Available
Synchronous vs. Asynchronous GPU Graph Frameworks

Pan, Yuechao; Osama, Muhammad; Owens, John D. (April 2017, The 7th Workshop on Multi-core and Rack-scale Systems)

Recent node-level GPU accelerated graph processing frameworks have separately chosen synchronous and asynchronous architectures. Which is better under which circumstances, and why? We focus on Gunrock (a synchronous framework) vs. Groute (an asynchronous framework) with 3 primitives on 3 different datasets. We identify load balance, kernel count, and communication latency and bandwidth as quantities of particular interest.
more » « less
Full Text Available
Multi-GPU Graph Analytics

https://doi.org/10.1109/IPDPS.2017.117

Pan, Yuechao; Wang, Yangzihao; Wu, Yuduo; Yang, Carl; Owens, John D. (May 2017, Proceedings of the 31st IEEE International Parallel and Distributed Processing Symposium)

We present a single-node, multi-GPU programmable graph processing library that allows programmers to easily extend single-GPU graph algorithms to achieve scalable performance on large graphs with billions of edges. Directly using the single-GPU implementations, our design only requires programmers to specify a few algorithm-dependent concerns, hiding most multi-GPU related implementation details. We analyze the theoretical and practical limits to scalability in the context of varying graph primitives and datasets. We describe several optimizations, such as direction optimizing traversal, and a just-enough memory allocation scheme, for better performance and smaller memory consumption. Compared to previous work, we achieve best-of-class performance across operations and datasets, including excellent strong and weak scalability on most primitives as we increase the number of GPUs in the system.
more » « less
Full Text Available

Search for: All records